We distinguish two basic characteristics:
R-specific)father mother name age gender
John 33 male
Julia 32 female
John Julia Jack 6 male
John Julia Jill 4 female
John Julia John jnr 2 male
David 45 male
Debbie 42 female
David Debbie Donald 16 male
David Debbie Dianne 12 female
Recall the COVID-19 data you have worked with as part of the exercises. We can store this data in the form of a CSV file as illustrated below. Commas separate columns, new lines separate rows, and the first row contains the column/variable names.
dateRep,day,month,year,cases,deaths,countriesAndTerritories,geoId,countryterritoryCode,popData2019,continentExp,Cumulative_number_for_14_days_of_COVID-19_cases_per_100000
14/10/2020,14,10,2020,66,0,Afghanistan,AF,AFG,38041757,Asia,1.94523087
13/10/2020,13,10,2020,129,3,Afghanistan,AF,AFG,38041757,Asia,1.81116766
12/10/2020,12,10,2020,96,4,Afghanistan,AF,AFG,38041757,Asia,1.50361089
The same data can also be stored as an XML file. The first few lines of this file could look like this:
<records>
<record>
<dateRep>14/10/2020</dateRep>
<day>14</day>
<month>10</month>
<year>2020</year>
<cases>66</cases>
<deaths>0</deaths>
<countriesAndTerritories>Afghanistan</countriesAndTerritories>
<geoId>AF</geoId>
<countryterritoryCode>AFG</countryterritoryCode>
<popData2019>38041757</popData2019>
<continentExp>Asia</continentExp>
<Cumulative_number_for_14_days_of_COVID-19_cases_per_100000>1.94523087</Cumulative_number_for_14_days_of_COVID-19_cases_per_100000>
</record>
<record>
<dateRep>13/10/2020</dateRep>
...
</records><, >, and / ) give the data
structure.<variablename>value</variablename>.For example, the entire COVID-19 dataset content we know from the csv
example above is nested between the ‘records’-tags:
<records>
...
</records>records is the “root-element”
of the XML-document).records contains several
record elements, which in turn contain several
tags/variables describing a unique record (such as
year).There are two principal ways to link variable names and data values in XML:
<variablename>value</variablename>. In the
example below:
<filename>ISCCPMonthly_avg.nc</filename>.<observation variablename="value">. In the example
below:
<case date="16-JAN-1994" temperature="9.200012" /> <variable>Monthly Surface Clear-sky Temperature (ISCCP) (Celsius)</variable>
<filename>ISCCPMonthly_avg.nc</filename>
<filepath>/usr/local/fer_data/data/</filepath>
<badflag>-1.E+34</badflag>
<subset>48 points (TIME)</subset>
<longitude>123.8W(-123.8)</longitude>
<latitude>48.8S</latitude>
<case date="16-JAN-1994" temperature="9.200012" />
<case date="16-FEB-1994" temperature="10.70001" />
<case date="16-MAR-1994" temperature="7.5" />
<case date="16-APR-1994" temperature="8.100006" />The same information can be stored either way, as the following example shows:
Attributes-based:
<case date="16-JAN-1994" temperature="9.200012" />
<case date="16-FEB-1994" temperature="10.70001" />
<case date="16-MAR-1994" temperature="7.5" />
<case date="16-APR-1994" temperature="8.100006" />Tag-based:
<cases>
<case>
<date>16-JAN-1994<date/>
<temperature>9.200012<temperature/>
<case/>
<case>
<date>16-FEB-1994<date/>
<temperature>10.70001<temperature/>
<case/>
<case>
<date>16-MAR-1994<date/>
<temperature>7.5<temperature/>
<case/>
<case>
<date>16-APR-1994<date/>
<temperature>8.100006<temperature/>
<case/>
<cases/>Note the key differences of storing data in XML format in contrast to a flat, table-like format such as CSV:
Potential drawback of XML: inefficient storage:
The following two data samples show the same information once stored in an XML file and once in a JSON file:
XML:
<person>
<firstName>John</firstName>
<lastName>Smith</lastName>
<age>25</age>
<address>
<streetAddress>21 2nd Street</streetAddress>
<city>New York</city>
<state>NY</state>
<postalCode>10021</postalCode>
</address>
<phoneNumber>
<type>home</type>
<number>212 555-1234</number>
</phoneNumber>
<phoneNumber>
<type>fax</type>
<number>646 555-4567</number>
</phoneNumber>
<gender>
<type>male</type>
</gender>
</person>JSON:
{"firstName": "John",
"lastName": "Smith",
"age": 25,
"address": {
"streetAddress": "21 2nd Street",
"city": "New York",
"state": "NY",
"postalCode": "10021"
},
"phoneNumber": [
{
"type": "home",
"number": "212 555-1234"
},
{
"type": "fax",
"number": "646 555-4567"
}
],
"gender": {
"type": "male"
}
}Data structured according to either XML or JSON syntax can be thought of as following a tree-like structure:
HyperText Markup Language (HTML) is designed to be read and rendered by a web browser. Yet, web pages (HTML-documents) also contain tables, raw text, and images and thus they are also a file format to store data.
The following short HTML-file constitues a very simple web page:
<!DOCTYPE html>
<html>
<head>
<title>hello, world</title>
</head>
<body>
<h2> hello, world </h2>
</body>
</html>head and body are nested within the
html documenthead, we define the title,
etc.<html>..</html><head>...</head>,
<body>...</body><head>...</head>,
<body>...</body>HTML (DOM) tree diagram.
In this example, we look at Wikipedia’s Economy of Switzerland page.
swiss_econ <- readLines("https://en.wikipedia.org/wiki/Economy_of_Switzerland")## Warning in readLines("https://en.wikipedia.org/wiki/Economy_of_Switzerland"): incomplete final line
## found on 'https://en.wikipedia.org/wiki/Economy_of_Switzerland'
Look at the first few imported lines:
head(swiss_econ)## [1] "<!DOCTYPE html>"
## [2] "<html class=\"client-nojs\" lang=\"en\" dir=\"ltr\">"
## [3] "<head>"
## [4] "<meta charset=\"UTF-8\"/>"
## [5] "<title>Economy of Switzerland - Wikipedia</title>"
## [6] "<script>document.documentElement.className=\"client-js\";RLCONF={\"wgBreakFrames\":false,\"wgSeparatorTransformTable\":[\"\",\"\"],\"wgDigitTransformTable\":[\"\",\"\"],\"wgDefaultDateFormat\":\"dmy\",\"wgMonthNames\":[\"\",\"January\",\"February\",\"March\",\"April\",\"May\",\"June\",\"July\",\"August\",\"September\",\"October\",\"November\",\"December\"],\"wgRequestId\":\"09ea6a9d-394e-40c1-a96f-725cdd9a7403\",\"wgCSPNonce\":false,\"wgCanonicalNamespace\":\"\",\"wgCanonicalSpecialPageName\":false,\"wgNamespaceNumber\":0,\"wgPageName\":\"Economy_of_Switzerland\",\"wgTitle\":\"Economy of Switzerland\",\"wgCurRevisionId\":1115528632,\"wgRevisionId\":1115528632,\"wgArticleId\":27465,\"wgIsArticle\":true,\"wgIsRedirect\":false,\"wgAction\":\"view\",\"wgUserName\":null,\"wgUserGroups\":[\"*\"],\"wgCategories\":[\"CS1 maint: archived copy as title\",\"CS1 German-language sources (de)\",\"Articles with German-language sources (de)\",\"Webarchive template wayback links\",\"Articles with French-language sources (fr)\",\"Articles with short description\",\"Short description is different from Wikidata\","
Select specific lines (select specific parts of the data):
swiss_econ[231]## [1] "<th>US Dollar Exchange"
# install package if not yet installed
# install.packages("rvest")
# load the package
library(rvest)# parse the webpage, show the content
swiss_econ_parsed <- read_html("https://en.wikipedia.org/wiki/Economy_of_Switzerland")
swiss_econ_parsed## {html_document}
## <html class="client-nojs" lang="en" dir="ltr">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n<meta charset="U ...
## [2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-subject mw-editable page-Eco ...
Now we can easily separate the data/text from the html code. For
example, we can extract the HTML table containing the data we are
interested in as a data.frames.
tab_node <- html_node(swiss_econ_parsed,
xpath = "//*[@id='mw-content-text']/div/table[2]")
tab <- html_table(tab_node)
tab## # A tibble: 19 × 3
## Year `GDP (billions of CHF)` `US Dollar Exchange`
## <int> <int> <chr>
## 1 1980 184 1.67 Francs
## 2 1985 244 2.43 Francs
## 3 1990 331 1.38 Francs
## 4 1995 374 1.18 Francs
## 5 2000 422 1.68 Francs
## 6 2005 464 1.24 Francs
## 7 2006 491 1.25 Francs
## 8 2007 521 1.20 Francs
## 9 2008 547 1.08 Francs
## 10 2009 535 1.09 Francs
## 11 2010 546 1.04 Francs
## 12 2011 659 0.89 Francs
## 13 2012 632 0.94 Francs
## 14 2013 635 0.93 Francs
## 15 2014 644 0.92 Francs
## 16 2015 646 0.96 Francs
## 17 2016 659 0.98 Francs
## 18 2017 668 1.01 Francs
## 19 2018 694 1.00 Francs
First few steps in a text analysis/natural language processing (NLP) pipeline:
The package quanteda is the most complete and go-to
package for text analysis in R. In order to run quanteda,
several packages need to be installed. You can use the following command
to make sure that missing packages are installed.
The base, raw material, of quantitative text analysis is a corpus. A corpus is, in NLP, a collection of authentic text organized into datasets.
In the specific case of quanteda, a corpus is a
a data frame consisting of a character vector for documents, and
additional vectors for document-level variables. In other
words, a corpus is a data frame that contains, in each row, a text
document, and additional columns with the corresponding metadata about
the text.
In the examples below, we will use the inauguration
corpus from quanteda, which is a standard corpus used in
introductory text analysis. It contains the inauguration discourses of
the five first US presidents. This text data can be loaded from the
readtext package. The metadata of this corpus is the year
of the inauguration and the name of the president taking office.
# set path
path_data <- system.file("extdata/", package = "readtext")
# import csv file
dat_inaug <- read.csv(paste0(path_data, "/csv/inaugCorpus.csv"))
names(dat_inaug)## [1] "texts" "Year" "President" "FirstName"
# Create a corpus
corp <- corpus(dat_inaug, text_field = "texts")
print(corp)## Corpus consisting of 5 documents and 3 docvars.
## text1 :
## "Fellow-Citizens of the Senate and of the House of Representa..."
##
## text2 :
## "Fellow citizens, I am again called upon by the voice of my c..."
##
## text3 :
## "When it was first perceived, in early times, that no middle ..."
##
## text4 :
## "Friends and Fellow Citizens: Called upon to undertake the du..."
##
## text5 :
## "Proceeding, fellow citizens, to that qualification which the..."
# Look at the metadata in the corpus using `docvars`
docvars(corp)## Year President FirstName
## 1 1789 Washington George
## 2 1793 Washington George
## 3 1797 Adams John
## 4 1801 Jefferson Thomas
## 5 1805 Jefferson Thomas
# In quanteda, the metadata in a corpus can be handled like data frames.
docvars(corp, field = "Century") <- floor(docvars(corp, field = "Year") / 100) + 1
Once we have a corpus, we want to extract the substance of the text.
This means, in quanteda language, that we want to extract
tokens, i.e. to isolate the elements that constitute a
corpus in order to quantify them. Basically, tokens are expressions that
form the building blocks of the text. Tokens can be single words or
phrases (several subsequent words, so-called N-grams).
toks <- tokens(corp)
head(toks[[1]], 20)## [1] "Fellow-Citizens" "of" "the" "Senate" "and"
## [6] "of" "the" "House" "of" "Representatives"
## [11] ":" "Among" "the" "vicissitudes" "incident"
## [16] "to" "life" "no" "event" "could"
# Remove punctuation
toks <- tokens(corp, remove_punct = TRUE)
head(toks[[1]], 20)## [1] "Fellow-Citizens" "of" "the" "Senate" "and"
## [6] "of" "the" "House" "of" "Representatives"
## [11] "Among" "the" "vicissitudes" "incident" "to"
## [16] "life" "no" "event" "could" "have"
# Remove stopwords
stopwords("en")## [1] "i" "me" "my" "myself" "we" "our" "ours"
## [8] "ourselves" "you" "your" "yours" "yourself" "yourselves" "he"
## [15] "him" "his" "himself" "she" "her" "hers" "herself"
## [22] "it" "its" "itself" "they" "them" "their" "theirs"
## [29] "themselves" "what" "which" "who" "whom" "this" "that"
## [36] "these" "those" "am" "is" "are" "was" "were"
## [43] "be" "been" "being" "have" "has" "had" "having"
## [50] "do" "does" "did" "doing" "would" "should" "could"
## [57] "ought" "i'm" "you're" "he's" "she's" "it's" "we're"
## [64] "they're" "i've" "you've" "we've" "they've" "i'd" "you'd"
## [71] "he'd" "she'd" "we'd" "they'd" "i'll" "you'll" "he'll"
## [78] "she'll" "we'll" "they'll" "isn't" "aren't" "wasn't" "weren't"
## [85] "hasn't" "haven't" "hadn't" "doesn't" "don't" "didn't" "won't"
## [92] "wouldn't" "shan't" "shouldn't" "can't" "cannot" "couldn't" "mustn't"
## [99] "let's" "that's" "who's" "what's" "here's" "there's" "when's"
## [106] "where's" "why's" "how's" "a" "an" "the" "and"
## [113] "but" "if" "or" "because" "as" "until" "while"
## [120] "of" "at" "by" "for" "with" "about" "against"
## [127] "between" "into" "through" "during" "before" "after" "above"
## [134] "below" "to" "from" "up" "down" "in" "out"
## [141] "on" "off" "over" "under" "again" "further" "then"
## [148] "once" "here" "there" "when" "where" "why" "how"
## [155] "all" "any" "both" "each" "few" "more" "most"
## [162] "other" "some" "such" "no" "nor" "not" "only"
## [169] "own" "same" "so" "than" "too" "very" "will"
toks <- tokens_remove(toks, pattern = stopwords("en"))
head(toks[[1]], 20)## [1] "Fellow-Citizens" "Senate" "House" "Representatives" "Among"
## [6] "vicissitudes" "incident" "life" "event" "filled"
## [11] "greater" "anxieties" "notification" "transmitted" "order"
## [16] "received" "14th" "day" "present" "month"
# We can keep words we are interested in
tokens_select(toks, pattern = c("peace", "war", "great*", "unit*"))## Tokens consisting of 5 documents and 4 docvars.
## text1 :
## [1] "greater" "United" "Great" "United" "united" "great" "great" "united"
##
## text2 :
## [1] "united"
##
## text3 :
## [1] "war" "great" "United" "great" "great" "peace" "great" "peace" "peace" "United"
## [11] "peace" "peace"
## [ ... and 2 more ]
##
## text4 :
## [1] "greatness" "unite" "unite" "greater" "peace" "peace" "peace" "war"
## [9] "peace" "greatest" "greatest" "great"
## [ ... and 1 more ]
##
## text5 :
## [1] "United" "peace" "great" "war" "war" "War" "peace" "peace" "peace"
# Remove "fellow" and "citizen"
toks <- tokens_remove(toks, pattern = c(
"fellow*",
"citizen*",
"senate",
"house",
"representative*",
"constitution"
))
# Build N-grams (onegrams, bigrams, and 3-grams)
toks_ngrams <- tokens_ngrams(toks, n = 2:3)
# Build N-grams based on a structure: keep n-grams that containt a "not"
toks_neg_bigram_select <- tokens_select(toks_ngrams, pattern = phrase("never_*"))
head(toks_neg_bigram_select[[1]], 30)## [1] "never_hear" "never_expected" "never_hear_veneration" "never_expected_nation"
To create a dtm, we can use quanteda’s dfm
command, as shown below.
dfmat <- dfm(toks)
print(dfmat)## Document-feature matrix of: 5 documents, 1,817 features (72.31% sparse) and 4 docvars.
## features
## docs among vicissitudes incident life event filled greater anxieties notification transmitted
## text1 1 1 1 1 2 1 1 1 1 1
## text2 0 0 0 0 0 0 0 0 0 0
## text3 4 0 0 2 0 0 0 0 0 0
## text4 1 0 0 1 0 0 1 0 0 0
## text5 7 0 0 2 0 0 0 0 0 0
## [ reached max_nfeat ... 1,807 more features ]
dfmat <- dfm(toks)
dfmat <- dfm_trim(dfmat, min_termfreq = 2) # remove tokens that appear less than 1 times
topfeatures(dfmat, 10)## government may public can people shall country every us
## 40 38 30 27 27 23 22 20 20
## nations
## 18
# compute word frequencies as top feature
tstat_freq <- textstat_frequency(dfmat, n = 5)
# visualize frequencies in word cloud
textplot_wordcloud(dfmat, max_words = 100)(ref:rgb)
# Load two common packages
library(raster) # for raster images## Loading required package: sp
library(magick) # for wide range of raster and vector-based images ## Linking to ImageMagick 6.9.11.60
## Enabled features: fontconfig, freetype, fftw, heic, lcms, pango, webp, x11
## Disabled features: cairo, ghostscript, raw, rsvg
## Using 12 threads
We can generate images directly in R by populating arrays and saving the plots to disk.
# Step 1: Define the width and height of the image
width = 300;
height = 300
# Step 2: Define the number of layers (RGB = 3)
layers = 3
# Step 3: Generate three matrices corresponding to Red, Green, and Blue values
red = matrix(255, nrow = height, ncol = width)
green = matrix(0, nrow = height, ncol = width)
blue = matrix(0, nrow = height, ncol = width)
# Step 4: Generate an array by combining the three matrices
image.array = array(c(red, green, blue), dim = c(width, height, layers))
dim(image.array)## [1] 300 300 3
# Step 5: Create RasterBrick
image = brick(image.array)
print(image)## class : RasterBrick
## dimensions : 300, 300, 90000, 3 (nrow, ncol, ncell, nlayers)
## resolution : 0.003333333, 0.003333333 (x, y)
## extent : 0, 1, 0, 1 (xmin, xmax, ymin, ymax)
## crs : NA
## source : memory
## names : layer.1, layer.2, layer.3
## min values : 255, 0, 0
## max values : 255, 0, 0
# Step 6: Plot RGB
plotRGB(image)## Warning in .couldBeLonLat(x, warnings = warnings): CRS is NA. Assuming it is longitude/latitude
# Step 7: (Optional) Save to disk
png(filename = "red.png", width = width, height = height, units = "px")
plotRGB(image)## Warning in .couldBeLonLat(x, warnings = warnings): CRS is NA. Assuming it is longitude/latitude
dev.off()## png
## 2
# Common Packages for Vector Files
library(xml2)
# Download and read svg image from url
URL <- "https://upload.wikimedia.org/wikipedia/commons/1/1b/R_logo.svg"
Rlogo_xml <- read_xml(URL)
# Data structure
Rlogo_xml ## {xml_document}
## <svg preserveAspectRatio="xMidYMid" width="724" height="561" viewBox="0 0 724 561" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
## [1] <defs>\n <linearGradient id="gradientFill-1" x1="0" x2="1" y1="0" y2="1" gradientUnits="obje ...
## [2] <path d="M361.453,485.937 C162.329,485.937 0.906,377.828 0.906,244.469 C0.906,111.109 162.329 ...
## [3] <path d="M550.000,377.000 C550.000,377.000 571.822,383.585 584.500,390.000 C588.899,392.226 5 ...
xml_structure(Rlogo_xml)## <svg [preserveAspectRatio, width, height, viewBox, xmlns, xmlns:xlink]>
## <defs>
## <linearGradient [id, x1, x2, y1, y2, gradientUnits, spreadMethod]>
## <stop [offset, stop-color, stop-opacity]>
## <stop [offset, stop-color, stop-opacity]>
## <linearGradient [id, x1, x2, y1, y2, gradientUnits, spreadMethod]>
## <stop [offset, stop-color, stop-opacity]>
## <stop [offset, stop-color, stop-opacity]>
## <path [d, fill, fill-rule]>
## <path [d, fill, fill-rule]>
# Raw data
Rlogo_text <- as.character(Rlogo_xml)
# Plot
svg_img = image_read_svg(Rlogo_text)
image_info(svg_img)## # A tibble: 1 × 7
## format width height colorspace matte filesize density
## <chr> <int> <int> <chr> <lgl> <int> <chr>
## 1 PNG 724 561 sRGB TRUE 0 72x72
svg_img